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Abstract 



The AdaBoost algorithm was designed to combine many "weak" hypotheses that perform 
sHghtly better than random guessing into a "strong" hypothesis that has very low error. We 
study the rate at which AdaBoost iteratively converges to the minimum of the "exponential 
loss." Unlike previous work, our proofs do not require a weak-learning assumption, nor do 
they require that minimizers of the exponential loss are finite. Our first result shows that 
at iteration t, the exponential loss of AdaBoost's computed parameter vector will be at 
most £ more than that of any parameter vector of ^i-norm bounded by _B in a number of 
roimds that is at most a polynomial in B and 1/e. We also provide lower bounds showing 
that a polynomial dependence on these parameters is necessary. Our second result is that 
within C/e iterations, AdaBoost achieves a value of the exponential loss that is at most 
e more than the best possible value, where C depends on the dataset. We show that this 
dependence of the rate on s is optimal up to constant factors, i.e., at least fl{l/e) rounds 
are necessary to achieve within e of the optimal exponential loss. 

Keywords: AdaBoost, optimization, coordinate descent, convergence rate. 



1. Introduction 



The AdaBoost algorithm of iFreund and Schapird (|l997l ) was designed to combine many 
"weak" hypotheses that perform shghtly better than random guessing into a "strong" hypo- 
thesis that has very low error. Despite extensive theoretical and empirical study, basic 
properties of AdaBoost's convergence are not fully understood. In this work, we focus on 
one of those properties, namely, to find convergence rates that hold in the absence of any 
simplifying assumptions. Such assumptions, relied upon in much of the preceding work, 
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make it easier to prove a fast convergence rate for AdaBoost, but often do not hold in the 
cases where AdaBoost is commonly applied. 

AdaBoost can be viewed as a coordinate descent (or functional gradient descent) al- 

called the exponen- 



tial loss (Breimanl. 1999: Prean and Downs. 1998: 


Friedman et al.. 2000: Prie( 


man. 


Mason et al. 


200d: 


Onoda et al. 


19981: 


Ratsch et al.l 


2001:ISchapire and Sinm. 


1999) 



m labeled training examples {xi,yi), . . . , {xm,ym), where the Xj's are in some domain 
X and Ui S {— 1,+1}, and a finite (but typically very large) space of weak hypotheses 
7i = {hi, . . . , Hn}, where each hj : X ^ {— 1) +1}; the exponential loss is defined as 



^ m I N \ 

i=l \ j=l J 



where A = (Ai, . . . , A^v) is a vector of weights or parameters. In each iteration, a coordinate 
descent algorithm moves some distance along some coordinate direction Aj. For AdaBoost, 
the coordinate directions correspond to the individual weak hypotheses. Thus, on each 
round, AdaBoost chooses some weak hypothesis and step length, and adds these to the 
current weighted combination of weak hypotheses, which is equivalent to updating a single 
weight. The direction and step length are so chosen that the resulting vector A* in iteration 
t yields a lower value of the exponential loss than in the previous iterati on, L(A*) < L(A*~^ ). 
This repeat s until it reache s a mi nimizer if one exists. It was shown bv lColhns etaP (l2002l i 
and later by Zhang and Yu ( 20051 ). that AdaBoost asymptotically converges to the minimum 
possible exponential loss. That is, 

lim L(A*) = inf L(A). 

However, that work did not address a convergence rate to the minimizer of the exponential 
loss. 

Our work specifically addresses a recent conjecture of Schapire ( 2010l ) stating that there 
exists a positive constant c and a polynomial poly() such that for all training sets and all 
finite sets of weak hypotheses, and for all i? > 0, 



LiX) < min L(A) + 

A:||A||i<B 



poly(log m, B) 



(1) 



In other words, the exponential loss of AdaBoost will be at most e more than that of any 
other parameter vector A of £i-norm bounded by i? in a number of rounds that is bounded 
by a polynomial in log N, m, B and 1/e. (We require log N rather than since the number 
of weak hypotheses will typically be extremely large.) Along with an upper bound that is 
polynomial in these parameters, we also provide lower bound constructions showing some 
polynomial dependence on B and 1/e is necessary. Without any additional assumptions on 
the exponential loss L, and without altering AdaBoost's minimization algorithm for L, the 
best known c onver gence rate of AdaBoost prior to this work that we are aware of is that of 
Bickel et al.l (|2006l l who prove a bound on the rate of the form 0(l/yTogt). 

We provide also a convergence rate of AdaBoost to the minimum value of the exponential 
loss. Namely, within C/e iterations, AdaBoost achieves a value of the exponential loss that 
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is at most e more than the best possible value, where C depends on the dataset. This 
convergence rate is different from the one discussed above in that it has better dependence 
on e (in fact the dependence is optimal, as we show), and does not depend on the best 
solution within a ball of size B. However, this second convergence rate cannot be used to 
prove ([1]) since in certain worst case situations, we show the constant C may be larger than 
2™ (although usually it will be much smaller). 

Within the proof of the second convergence rate, we provide a lemma (called the de- 
composition lemma) that shows that the training set can be split into two sets of examples: 
the "finite margin set," and the "zero loss set." Examples in the finite margin set always 
make a positive contribution to the exponential loss, and they never lie too far from the 
decision boundary. Examples in the zero loss set do not have these properties. If we con- 
sider the exponential loss where the sum is only over the finite margin set (rather than over 
all training examples), it is minimized by a finite A. The fact that the training set can be 
decomposed into these two classes is the key step in proving the second convergence rate. 

This problem of determining the rate of convergence is relevant in the proof of the 
consistency of AdaBoost given bv lBartlett and Traskinl (jioO^), where it has a direct impact 
on the rate at which AdaBoost converges to the Bayes optimal classifier (under suitable 
assumptions). It may also be relevant to practitioners who wish to have a guarantee on the 
exponential loss value at iteration t (although, in general, minimization of the exponential 
loss need not be perfectly correlated with test accuracy). 



There have been several works that make additional assumptions on the exponential 
loss in order to attain a better bound on the rate, but those assumptions are not true 
in general, and cases are known w here each of t hese assumptions are vi olated. For in - 



stance , better bounds are proved by lRatsch et al.l (120021 ) using results from lLuo and Tseng 
(I1992I I. 3ut these appear to require that the exponential loss be minimized by a finite 
A, and also depend on quantities that are not easily measured. There are m any cases 
where L does not have a finite minimize r; in fact, one such case is provided by ISchapire 
(l20inh. IShalev-Shwartz and Singerl (120081 ) have proven bounds for a variant of AdaBoost. 
Zhang and Yul (I2OO5I I also have given rates of convergence, but their technique requires 
a bound on the change in the size of A* at each iteration that does not necessarily hold 
for AdaBoost. Many clas sic results are known on the convergence of iterative algo rithms 
generally (see for instance Luenberger and Ye . 20081 : Bovd and Vandenberghe , 2004 ): how- 
ever, these typically start by assuming that the minimum is attained at some finite point 
in the (usually compact) space of interest, assumptions that do not generally hold in our 
setting. When the weak learning assumption holds, there i s a parameter 7 > that go verns 
the improvement of t he ex ponential loss at each iteration. iFreund and Schapird (jl997l ) and 



Schapire and Singer (|l999l ;i showed that the exponential loss is at most e after t rounds 



so AdaBoost rapidly converges to the minimum possible loss under this assumption. 

In Section[2]we summarize the coordinate descent view of AdaBoost. Section [3] contains 
the proof of the conjecture, with associated lower bounds proved in Section [3^ Section U] 
provides the C/e convergence rate. The proof of the decomposition lemma is given in 
Section 1121 
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Given: (a;i,j/i), . . . , (xm,ym) where G Af, G {-1,+1} 

set % = {hi, . . . , hjsi} of weak hypotheses hj : X ^ { — 1; +!}• 
InitiaUze: Di{i) = 1/m for i = 1, . . . ,m. 
For i = 1, . . . ,r: 

• Train weak learner using distribution Dt, that is, find weak hypothesis ht £ H whose 
correlation rt = ^i^Ot [Uihtixi)] has maximum magnitude \rt\. 

• Choose at = i In {(1 + n) / (1 - n)}. 

• Update, for i = 1, . . . ,m: = Dt{i) exp{-atyiht{xi))/ Zt 

where Zt is a normalization factor (chosen so that Dt+i will be a distribution). 

Output the final hypothesis: F{x) = sign ^Xl^i Oitht{x)^ . 

Figure 1: The boosting algorithm AdaBoost. 



2. Coordinate Descent View of AdaBoost 



From the examples {xi,yi), . . . , {xm,ym) and hypotheses H = {hi, . . . , hjy}, AdaBoost iter- 
atively computes the function : — )■ R, where sign(i^(x)) can be used as a classifier for a 
new instance x. The function F is a linear combination of the hypotheses. At each iteration 
t, AdaBoost chooses one of the weak hypotheses ht from the set T-L, and adjusts its coefficient 



Et=i ottht{x). 



by a specified value at- Then F is constr ucted after T iterations a.s: F{x) 
Figure [1] shows the AdaBoost algorithm (jFreund and Schapird . 119971 ) . 

Since each ht is equal to hj^ for some jt, F can also be written F{x) = "^f^i ^jhj{x) 
for a vector of values A = (Ai, . . . Xn) (such vectors will sometimes also be referred to as 
combinations, since they represent combinations of weak hypotheses). In different notation, 
we can write AdaBoost as a coordinate descent algorithm on vector A. We define the 
feature matrix M elementwise by Mjj = yihj{xi), so that this matrix contains all of the 
inputs to AdaBoost (the training examples and hypotheses). Then the exponential loss can 
be written more compactly as: 



L(A) 



^ m 

m ^-^ 



-(MA), 



where (MA)j, the i^^ coordinate of the vector MA, is the (unnormalized) margin achieved 
by vector A on training example i. 

Coordinate descent algorithms choose a coordinate at each iteration where the direc- 
tional derivative is the steepest, and choose a step that maximally decreases the objective 
along that coordinate. To perform coordinate descent on the exponential loss, we determine 



the coordinate jt at iteration t as follows, where ej is a vector that is 1 in the j 
and elsewhere: 



th 



position 



it S argmax 



dL(A*-i +aej) 



da 



a=0 



1 

argmax — 

1 771 



i=l 



We can show that this is equivalent to the weak learning step of AdaBoost. Unraveling the 
recursion in Figure[T]for AdaBoost's weight vector Dt, we can see that F)t{i) is proportional 
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to 



\ t'<t J 



The term in the exponent can also be rewritten in terms of the vector A*, where A*- is the 
sum of Ot's where hypothesis hj was chosen: Ylit'<t^t''^[nj=h^i] = ^t-i,j- The term in the 
exponent is: 

t'<t j t'<t j 

where denotes the ith component of a vector. This means Dt{i) is proportional to 



-(MA'-i)i 



• Eq. ([2]) can now be rewritten as 



jt G argmax 

j 



argmax 




= argmax 


Ej^Dt [yihj{xi)] 


3 









which is exactly the way AdaBoost chooses a weak hypothesis in each round (see Figured]). 
The correlation ^iDtii)Mij^ will be denoted by rt and its absolute value |rj| denoted by 
6t- The quantity 5t is commonly called the edge for round t. The distance at t o travel along 



direc tion jt is found for coordinate descent via a linesearch (see for instance iMason et al. 
200d ): 







dL{\t + atej^ 
dat 



-(A/(At+ate,j)_ 



and dividing both sides by the normalization factor, 

0= Yl A(i)e-°*- Yl A(i)e"' = (l+n)e-"'-(l-ri)e^ 



just as in Figured] Thus, AdaBoost i s equivalent to coordinate d escent on L{X). With this 
choice of step length, it can be shown ( Freund and Schapire . 19971 ) that the exponential loss 
drops by an amount depending on the edge: 



LiXt 



L{\ 



((1 + ri)e-"' + (1 - Tt)e^') L{Xt^i) = 2 ^(1 + rt)(l - r*) L(At) 



l-r^]LiXt.,] 



1-^2 L(At_i). 



Our rate bounds also hold when the weak-hypotheses are confidence-rated, that is, giving 
real-valued predictions in [— 1,-|-1], so that h : X ^ [— 1,-|-1]. In that case, the criterion 
for picking a weak hypothesis in each round remains the same, that is, at round t, an hj^ 



maximizing the absolute correlation jt G argmax • 



, is chosen, where 
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Mij may now be non-integral. An exact analytical line search is no longer possible, but if 
the step size is chosen in the same way, 



at 



In 



i + n 
i-n 



(3) 



then lFreund and Schapird (119971 ) and lSchapire and Singer (|l999l l show that a similar drop 
in the loss is still guaranteed: 



si 



(4) 



With confidence rated hypotheses, other implementations may choose the step size in a 
different way. However, in this paper, by "AdaBqost" we will always mean the version in 
(jFreund and Schapird . 119971 : ISchapire and Singer! . I1999I ) which chooses step sizes as in ([3]), 
and enjoys the loss guarantee as in i^. That said, all our proofs work more generally, 
and are robust to numerical inaccuracies in the implementation. In other words, even if 
the previous conditions are violated by a small amount, similar bounds continue to hold, 
although we leave out explicit proofs of this fact to simplify the presentation. 



3. First convergence rate: Convergence to any target loss 

In this section, we bound the number of rounds of AdaBoost required to get within e of the 
loss attained by a parameter vector A* as a function of e and the £i-norm || A* The vector 
A* serves as a reference based on which we define the target loss L(A*), and we will show 
that its ^i-norm measures the difficulty of attaining the target loss in a specific sense. We 
prove a bound polynomial in 1/e, ||A*||i and the number of e xamples m, sho wing ([T]) holds, 
thereby resolving affirmatively the open problem posed in (ISchaDird . [2O10l '). Later in the 
section we provide lower bounds showing how a polynomial dependence on both parameters 
is necessary. 



3.1 Upper Bound 

The main result of this section is the following rate upper bound. 

Theorem 1 For any A* G M^, AdaBoost achieves loss at most L{X*) + e in at most 
13||A*||fe~^ rounds. 

The high level idea behind the proof of the theorem is as follows. To show a fast rate, we 
require a large edge in each round, as indicated by (jlj). A large edge is guaranteed if the 
size of the current solution of AdaBoost is small. Therefore AdaBoost makes good progress 
if the size of its solution does not grow too fast. On the other hand, the increase in size of 
its solution is given by the step length, which in turn is proportional to the edge achieved 
in that round. Therefore, if the solution size grows fast, the loss also drops fast. Either way 
the algorithm makes good progress. In the rest of the section we make these ideas concrete 
through a sequence of lemmas. 

We provide some more nota tion. Througho ut, A* is fixed, and its £i-norm is denoted 
by B (matching the notation in Schapire . 20ld ). One key parameter is the suboptimality 
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Rt of AdaBoost's solution measured via the logarithm of the exponential loss: 

Rt = In L{X^) - In L{\*). 

Another key parameter is the ^i-distance St of AdaBoost's solution from the closest com- 
bination that achieves the target loss: 

5t = inf {||A- A*||i : L(A) < L{X*)} . 
We will also be interested in how they change as captured by 

ARt = Rt-i — Rt, ^St = St — St-i- 

Notice that ARt is always non- negative since AdaBoost decreases the loss, and hence the 
suboptimality, in each round. Let Tq be the bound on the number of rounds in Theorem [TJ 
We assume without loss of generality that Rq, . . . , Rtq and Sq, . . . , Stq are all strictly posi- 
tive, since otherwise the theorem holds trivially. Also, in the rest of the section, we restrict 
our attention entirely to the first Tq rounds of boosting. We first show that a poly(-B,e~^) 
rate of convergence follows if the edge is always polynomially large compared to the subop- 
timality. 

Lemma 2 If for some constants ci,C2, where C2 > 1/2, the edge satisfies St > B~'^^Rl^^ in 
each round t, then AdaBoost achieves at most L(A*) + e loss after 2B^^^ (e In 2Y~'^^'^ rounds. 

Proof From the definition of Rt and @ we have 

ARt = lnL(A*~^) -lnL(A*) > -^ln(l - S^). (5) 

Combining the above with the inequality > 1 + x, and the assumption on the edge 

ARt>-\ ln(l - 5?) > > }-B~^^^R]% 

Let T = \2B'^^^ (e In 2)^~'^'^^~\ be the bound on the number of rounds in the lemma. If any of 
Rq, . . . , Rt is negative, then by monotonicity Rt < and we are done. Otherwise, they are 
all non-negative. Then, applying Lemma [32] from the Appendix to the sequence Rq, . . . , Rt, 
and using C2 > 1/2 we get 

R].-^"^ > rI~^^^ + C2B~^^^T > {l/2)B~^^-'T > (eln2)^-2'=2 =^ Rt < e\n2. 

If either e or L{\*) is greater than 1, then the lemma follows since -^^(A-^) < L{\^) = 1 < 
L(A*) + e. Otherwise, 

L(A^) < L(A*)e^i°2 < L{\*){1 + e) < L(A*) + e, 

where the second inequality uses < 1 + (l/ln2)x for x G [0, ln2]. ■ 

We next show that large edges are achieved provided St is small compared to Rt- 

Lemma 3 In each round t, the edge satisfies 6t > Rt^i/ St~i- 
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Proof For any combination A, define px as the distribution on examples {1, . . . ,m} that 
puts weight proportional to the loss D\{i) = e~^^^^^ /{mL{X)). Choose any A suffering at 
most the target loss L{X) < L{X*). By non-negativity of relative entropy we get 



< RE(Z5v-i II Dx) = J2Dx^ 



In 



1=1 



i^e-(MA),/L(A) 
-Rt-i + Dxt~i (i) (MA - MA*-i)^ . 



(6) 



i=l 



Note that Dxt-i is the distribution Dt that AdaBoost creates in round t. The above 
summation can be rewritten as 



N 



M, 



i=l 



< 




(7) 



Since the previous holds for any A suffering less than the target loss, the last expression is 
at most 6tSt~i. Combining this with ([7|) completes the proof. ■ 

To complete the proof of Theorem [H we show St is small compared to Rt in rounds t < Tq 
(during which we have assumed St, Rt are all positive). In fact we prove: 

Lemma 4 For any t < Tq, St < B^R^"^. 

This, along with Lemmas [2] and [3l immediately proves Theorem [TJ The bound on St in 
Lemma m can be proven if we can first show St grows slowly compared to the rate at which 
the suboptimality Rt falls. Intuitively this holds since growth in St is caused by a large 
step, which in turn will drive down the suboptimality. In fact we can prove the following. 



Lemma 5 In any round t <Tq, we have > 



ASt 
St~i 



Proof Firstly, it follows from the definition of St that A^f < ||A — A ||i = \at\. Next, 
using ([5l) and (l3l) we rnay wr ite ARt > T{6t) \at\, where the function T has been defined in 
(jRatsch and Warmuthl . |2005| ) as 



T(rr) 



ln(l 



In 



l+x 
l-x 



It is known (IRatsch and Warmuthl . l2005l : iRudin et al.l . l2007l ) that T(x) > x/2 for x G [0, 1]. 
Combining and using Lemma O 

ARt > dtASt/2 > Rt^i {ASt/2St^i) . 

Rearranging completes the proof. ■ 
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Using this we may prove Lemma [H 

Proof We first show < B^R^'^. Note, Sq < ||A* - A°||i = B, and by definition the 
quantity Rq = — In e"^'^'^*-'') • The quantity (MA*)j is the inner product of row i 

of matrix M with the vector A*. Since the entries of M lie in [— 1,+1], this is at most 
||A*||i = B. Therefore i?o < — In e~^) = B, which is what we needed. 

To complete the proof, we show that R^St is non-increasing. It suffices to show for any 
t the inequality R^St < R^_iSt~i. This holds by the following chain: 



Rt-i J \ Si 



< Rt-iSt-i exp — + - — ) < Rt_^St-i, 



where the first inequality follows from > 1 + x, and the second one from Lemma [5j ■ 

This completes the proof of Theorem [TJ Although our b ound provides a rate polynomial 
in B,£~^ as desired by the conjecture in (jSchapirel . \2Qld ). the exponents are rather large, 



and (we believe) not tight. One possible source of slack is the bound on St in Lemma [H 
Qualitatively, the distance St to some solution having target loss should decrease with 
rounds, whereas Lemma H] only says it does not increase too fast. Improving this will 
directly lead to a faster convergence rate. In particular, showing that St never decreases 
would imply a B'^/e rate of convergence. Whether or not the monotonicity of St holds, we 
believe that the obtained rate bound is probably true, and state it as a conjecture. 

Conjecture 6 For any A* and e > 0, AdaBoost converges to within L{X*) + e loss in 
0{B'^/£) rounds, where the order notation hides only absolute constants. 

As evidence supporting the conjecture, we show in the next section how a minor modification 
to AdaBoost can achieve the above rate. 



3.2 Faster rates for a variant 

In this section we introduce a new algorithm, AdaBoost. 5 , which will enjoy the much faster 
rate of convergence mentioned in Conjecture [6l AdaBoost. 5 is the same as AdaBoost, except 
that at the end of each round, the current combination of weak hypotheses is scaled back, 
that is, multiplied by a scalar in [0, 1] if doing so will reduce the exponential loss further. The 
code is largely the same as in Section^ maintaining a combination A*^^ of weak hypotheses, 
and greedily choosing at and hj^ on each round to form a new combination A* = X*~^ +athj^. 
However, after creating the new combination A*, the result is multiplied by the value st in 
[0,1] that causes the greatest decrease in the exponential loss: sj = argmin^, L(sA*), and 
A* = s^A*. Since L(sA*), as a function of s, is convex, its minimum on [0, 1] can be found 
easily, for instance, using a simple binary search. The new distribution Dj+i on the examples 
is constructed using A* as before; the weight Dt+i{i) on example i is proportional to its 
exponential loss Dt+i{i) oc e"^'^'^ ^\ With this modification we may prove the following: 

Theorem 7 For any \*,e > 0, AdaBoost.S achieves at most L{\*)-\-e loss within 3\\X*\\i/e 
rounds. 
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The proof is similar to that in the previous section. Reusing the same notation, note that 
proof of Lemma [2] continues to hold (with very minor modifications to that are straight- 
forward). Next we can exploit the changes in AdaBoost.S" to show an improved version of 
Lemma [3l Intuitively, scaling back has the effect of preventing the weights on the weak 
hypotheses from becoming "too large" , and we may show 

Lemma 8 In each round t, the edge satisfies 5t > Rt~i/B. 

Proof We will reuse parts of the proof of Lemma [3j Setting A = A* in ([6]) we may write 



The first summation can be upper bounded as in ^ by 5(||A*|| = 6tB. We will next show 
that the second summation is non-positive, which will complete the proof. The scaling step 
was added just so that this last fact would be true. 



If we define G : [0, 1] ^ M to be G{s) = L (sXA = X^- e-(^^'*)8, then observe that the 



scaled derivative G'{s)/G{s) is exactly equal to the second summation. Since G{s) > 0, 
it suffices to show the derivative G'{s) < at the optimum value of s, denoted by s*. 
Since G is a strictly convex function (Vs : G"{s) > 0), it is either strictly increasing or 
strictly decreasing throughout [0, 1], or it has a local minima. In the case when it is strictly 
decreasing throughout, then G'{s) < everywhere, whereas if G has a local minima, then 
G'(s) = at s*. We finish the proof by showing that G cannot be strictly increasing 
througout [0, 1]. If it were, we would have L{X^) = G{1) > G{0) = 1, an impossibility since 
the loss decreases through rounds. ■ 

Lemmas [2] and [8] together now imply Theorem [71 where we used that 2 In 2 < 3. 

In experiments we ran, the scaling back never occurs. For such datasets, AdaBoost 
and AdaBoost. 5" are identical. We believe that even for contrived examples, the rescaling 
could happen only a few times, implying that both AdaBoost and AdaBoost. 5 would enjoy 
the convergence rates of Theorem [71 In the next section, we construct rate lower bound 
examples to show that this is nearly the best rate one can hope to show. 

3.3 Lower-bounds 

Here we show that the dependence of the rate in Theorem [T] on the norm ||A*||i of the 
solution achieving target accuracy is necessary for a wide class of datasets. The arguments 
in this section are not tailored to AdaBoost, but hold more generally for any coordinate 
descent algorithm, and can be readily generalized to any loss function L' of the form L'[X) = 
(1/m) (/)(MA), where : M — )• R is any non-decreasing function. The first lemma 
connects the size of a reference solution to the required number of rounds of boosting, and 
shows that for a wide variety of datasets the convergence rate to a target loss can be lower 
bounded by the ^i-norm of the smallest solution achieving that loss. 

Lemma 9 Suppose the feature matrix M corresponding to a dataset has two rows with 
{ — 1,-1-1} entries which are complements of each other, i.e., there are two examples on 
which any hypothesis gets one wrong and one correct prediction. Then the number of rounds 
required to achieve a target loss L* is at least inf {||A||i : L{X) < L*} /(21nm). 



m m 



Rt<Yl (i) (MX*), + -^A-i (i) (MA*-i) . . 



i=l 1=1 
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Figure 2: The matrix used in Theorem [TOl wlien m — b. 



Proof We first show that the two examples corresponding to the complementary rows 
in M both satisfy a certain margin boundedness property. Since each hypothesis predicts 
oppositely on these, in any round t their margins will be of equal magnitude and opposite 
sign. Unless both margins lie in [—Inm, Inm], one of them will be smaller than — Inm. 
But then the exponential loss i(A*) = {l/m)^^ e~^^^ in that round will exceed 1, a 
contradiction since the losses are non-increasing through rounds, and the loss at the start 
was 1. Thus, assigning one of these examples the index i, we have the absolute margin 
|(MA*)j| is bounded by Inm in any round t. Letting M(i) denote the ith row of M, the 
step length at in round t therefore satisfies 

\at\ = \Mij,at\ = \{m{i),atej,)\ = \{m\% - (MA*-i)i| < |(MA*)i| + \{m\'~^)i\ < 21nm, 

and the statement of the lemma directly follows. ■ 
When the weak hypotheses are abstaining (jSchapire and Singeii . Il999l l. it can make a 



definitive prediction that the label is —1 or +1, or it can "abstain" by predicting zero. 
No other levels of confidence are allowed, and the resulting feature matrix has entries in 
{— 1,0,+1}. The next theorem constructs a feature matrix satisfying the properties of 
Lemma [9] and where additionally the smallest size of a solution achieving L* + e loss is at 
least f2(2"*) ln(l/e), for some fixed L* and every e > 0. 

Theorem 10 Consider the following matrix M with m rows ( or examples) labeled 0, . . . , m— 
1 and m — 1 columns labeled 1, . . . , m — 1 ( assume m > 3). The square sub-matrix ignoring 
row zero is an upper triangular matrix, with 1 's on the diagonal, —I's above the diagonal, 
and below the diagonal. Therefore row 1 is (+1, — 1, — 1, . . . , — !)■ Row is defined to be 
just the complement of row 1. Then, for any e > 0, a loss of 2/m + e is achievable on this 
dataset, but with large norms 

inf{||A||i : L(A) < 2/m + e} > (2'"-2 - 1) ln(l/(3e)). 

Therefore, by Lemma [21 the minimum number of rounds required for reaching loss at most 
2/m + e is at least ( ^ainm^ ) ln(l/(3e)). 

A picture of the matrix constructed in the above lemma for m = 5 is shown in Figure [2j 
Theorem 1101 shows that when e is a small constant (say e = 0.01), and A* is some vector 
with loss L* + e/2, AdaBoost takes at least r2(2'"/lnm) steps to get within e/2 of the loss 
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achieved by A*, that is, to within L* + £ loss. Since m and e are independent quantities, this 
shows that a polynomial dependence on the norm of the reference solution is unavoidable, 
and this norm might be exponential in the number of training examples in the worst case. 

Corollary 11 Consider feature matrices containing only { — 1,0,+1} entries. If, for some 
constants c and /3, the hound in Theorem\^ can be replaced by O (||A*||5e~^) for all such 
matrices, then c > 1. Further, for such matrices, the bound poly(l/e, ||A*||i) in TheoremU\ 
cannot be replaced by poly(l/e, m, A^). 

We now prove Theorem I10[ 

Proof of Lemma llOi We first lower bound the norm of solutions achieving loss at most 
2/m + £. Observe that since rows and 1 are complementary, any solution's loss on just 
examples and 1 will add up to at least 2/m. Therefore, to get within 2/m + e, the margins 
on examples 2, . . . ,m — 1 should be at least In ((m — 2) / (me)) > ln(l/(3e)) (for m > 3). 
Now, the feature matrix is designed so that the margins due to a combination A satisfy the 
following recursive relationships: 

— ^m~l, 

{M\)i = Ai-(Ai+i + ... + A^„i), for l<i<m-2. 

Therefore, the margin on example m — 1 is at least ln(l/(3e)) implies A^-i > ln(l/(3e)). 
Similarly, Xm-2 > lii(l/(3e)) + Am_i > 21n(l/(3e)). Continuing this way, 

> ,„ (^) , , + > 1„ (1) {l + 2<'»-.'-<-«. + . . . + 2»} = in (1) 2"-- 

for f = m - 1,... ,2. Hence ||A||i > ln(l/(3e))(l + 2 + . . . + 2™-^) = (2""-2 _ 1) ln(l/(3e)). 

We end by showing that a loss of at most 2/m + e is achievable. The above argument 
implies that if Aj = 2"^~^~* for i = 2, . . . , m — 1, then examples 2, . . . , m — 1 attain margin 
exactly 1. If we choose Ai = A2 + . . . + Am_i = 2"^"^ + . . . + 1 = 2''"~2 - 1, then the recursive 
relationship implies a zero margin on example 1 (and hence example 0). Therefore the 
combination ln(l/e)(2'"-2_i^ 2^-3,2™"^, . . . , 1) achieves a loss (2+(m-2)e)/m < 2/m+e, 
for any e > 0. ■ 

We finally show that if the weak hypotheses are confidence-rated with arbitrary levels of 
confidence, so that the feature matrix is allowed to have non- integral entries in [— 
then the minimum norm of a solution achieving a fixed accuracy can be arbitrarily large. 
Our constructions will satisfy the requirements of Lemma [H so that the norm lower bound 
translates into a rate lower bound. 

Theorem 12 Let v > be an arbitrary number, and let M be the (possibly) non-integral 
matrix with 4 examples and 2 weak hypotheses shown in Figure O Then for any e > 0, a 
loss of 1/2 + e is achievable on this dataset, but with large norms 

inf {||A||i : L(A) < 1/2 + e} > 2 ln(l/(2e))i/~^ 

Therefore, by Lemma\^ the number of rounds required to achieve loss at most 1/2 + e is at 
least ln(l/(2e))i/-Vln(m). 
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Figure 3: A picture of the matrix used in Theorem 1121 



Proof We first show a loss of 1/2 + e is achievable. Observe that the vector A = (c, c), 
with c = / {le)) , achieves margins 0, 0, ln(l/(2e)), ln(l/(2e)) on examples 1,2,3,4, 

respectively. Therefore A achieves loss 1/2 + e. We next show a lower bound on the norm of 
a solution achieving this loss. Observe that since the first two rows are complementary, the 
loss due to just the first two examples is at least 1/2. Therefore, any solution A = (Ai, A2) 
achieving at most 1/2 + e loss overall must achieve a margin of at least ln(l/(2e)) on both 
the third and fourth examples. By inspecting the two columns, this implies 

A1-A2 + A2Z/ > ln(l/(2e)) 
Aa-Ai + Aii^ > ln(l/(2e)). 

Adding the two equations we find 

z^(Ai + A2) > 21n(l/(2e)) =^ Ai + A2 > 2/^-1 In (l/(2e)) . 

By the triangle inequality, ||A||i > Ai + A2, and the lemma follows. ■ 

Note that if = 0, then the optimal solution is found in zero rounds of boosting and has 
optimal loss 1. However, even the tiniest perturbation > causes the optimal loss to fall 
to 1/2, and causes the rate of convergence to increase drastically. In fact, by Theorem I12| 
the number of rounds required to achieve any fixed loss below 1 grows as Q{l/v), which is 
arbitrarily large when v is infinitesimal. We may conclude that with non-integral feature 
matrices, the dependence of the rate on the norm of a reference solution is absolutely 
necessary. 

Corollary 13 When using confidence rated weak-hypotheses with arbitrary confidence lev- 
els, the 6otind poly(l/e, II A* 111) in TheoremUl cannot be replaced by any function of purely 
m, N and e alone. 

The construction in Figure [3] can be generalized to produce datasets with any number of 
examples that suffer the same poor rate of convergence as the one in Theorem [T2j We 
discussed the smallest such construction, since we feel that it best highlights the drastic 
effect non-integrality can have on the rate. 

In this section we saw how the norm of the reference solution is an important parameter 
for bounding the convergence rate. In the next section we investigate the optimal depen- 
dence of the rate on the parameter e and show that r2(l/e) rounds are necessary in the 
worst case. 
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4. Second convergence rate: Convergence to optimal loss 

In the previous section, our rate bound depended on both the approximation parameter e, 
as well as the size of the smallest solution achieving the target loss. For many datasets, the 
optimal target loss inf;^L(A) cannot be realized by any finite solution. In such cases, if we 
want to bound the number of rounds needed to achieve within e of the optimal loss, the 
only way to use Theorem [1] is to first decompose the accuracy parameter e into two parts 
e = £1 + 62, find some finite solution A* achieving within ei of the optimal loss, and then use 
the bound poly(l/e2, ||A*||i) to achieve at most L{\*) + £2 = inf^ -^^(A) + e loss. However, 
this introduces implicit dependence on e through ||A*||i which may not be immediately 
clear. In this section, we show bounds of the form C/e, where the constant C depends only 
on the feature matrix M, and not on e. Additionally, we show that this dependence on e 
is optimal in Lemma [31] of the Appendix, where r2(l/e) rounds are shown to be necessary 
for converging to within e of the optimal loss on a certain dataset. Finally, we note that 
the lower bounds in the previous section indicate that C can be r2(2"^) in the worst case for 
integer matrices (although it will typically be much smaller), and hence this bound, though 
str onger tha i i that of Theorem [T] with respect to e, cannot be used to prove the conjecture 
m (|SchaDirel . [2O10l ). since the constant is not polynomial in the number of examples m. 



4.1 Upper Bound 

The main result of this section is the following r ate upper bound . A similar approach to 
solving this problem was taken independently by Telgarsky ( 20 111 ). 



Theorem 14 AdaBoost reaches within e of the optimal loss in at most C/e rounds, where 
C only depends on the feature matrix. 



Our techniques build upon earlier work on the rate of convergence of AdaBoost, which have 
mainly considered two particular cases. In the first case, the weak learning assumption 
holds, that is, the edge in each round is at least some fixe d constant. In this situation, 
Freund and Schapird ( 19971 ) and ISchapire and Singer (|l999l ) show that the optimal loss is 
zero, that no solution with finite size can achieve this loss, but AdaBoost achieves at most e 
loss within 0(ln(l/e)) rounds. In th e second case some finite combination o f the weak clas- 
sifiers achieves the optimal loss, and lRatsch et al.l (l2002l ). using results from lLuo and Tseng 
(Il992l l. show that AdaBoost achieves within e of the optimal loss again within 0(ln(l/e)) 
rounds. 

Here we consider the most general situation, where the weak learning assumption may 
fail to hold, and yet no finite solution may achieve the optimal loss. The dataset used in 
Lemma[3l]and shown in Figure|4]exemplifies this situation. Our main technical contribution 
shows that the examples in any dataset can be partitioned into a zero-loss set and finite- 
margin set, such that a certain form of the weak learning assumption holds within the 
zero- loss set, while the optimal loss considering only the finite-margin set can be obtained 
by some finite solution. The two partitions provide different ways of making progress in 
every round, and one of the two kinds of progress will always be sufficient for us to prove 
Theorem [TH 
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We next state our decomposition result, illustrate it with an example, and then state 
several lemmas quantifying the nature of the progress we can make in each round. Using 
these lemmas, we prove Theorem 1141 

Lemma 15 (Decomposition Lemma) For any dataset, there exists a partition of the set of 
training examples X into a (possibly empty) zero- loss set Z and a (possibly empty) finite- 
margin set F = = X \ Z such that the following hold simultaneously : 

1. For some positive constant 7 > 0, there exists some vector r]^ with unit £i-norm 
||r/^||i = 1 that attains at least 7 margin on each example in Z, and exactly zero 
margin on each example in F 

-iieZ : (MT7l")i > 7, V« G F : (Mt?^")^ = 0. 



2. The optimal loss considering only examples within F is achieved by some finite com- 
bination T]* . 

3. There is a constant /Umax < 00, such that for any combination t] with bounded loss on 
the finite-margin set, X^jgi? e"^'^^^' < m, the margin (Mr;)^ for any example i in F 
lies in the bounded interval [— lnm,^max]- 

A proof is deferred to the next section. The decomposition lemma immediately implies that 
the vector ?7* -|-oo rj^ , which denotes (77* -|- crj^^ in the limit c — )■ 00, is an optimal solution, 
achieving zero loss on the zero-loss set, but only finite margins (and hence positive losses) 
on the finite-margin set (thereby justifying the names). 

Before proceeding, we give an example dataset and indi- 
cate the zero- loss set, finite- margin set, J7* and rj^ to illus- 
trate our definitions. Consider a dataset with three examples 
{a, b, c} and two hypotheses {^1, ^2} and the feature matrix M 
in Figure m Here + means correct {Mij = +1) and — means 
wrong (Mij = — 1). The optimal solution is 00 • (^1 + ^2) with 
a loss of 2/3. The finite-margin set is {a, b}, the zero-loss set is 
{c}, T]^ = (1/2, 1/2) and rj* = (0,0); for this dataset these are 
unique. This dataset also serves as a lower-bound example in 
Lemma [3T1 where we show that 2/{9e) rounds are necessary 
for AdaBoost to achieve loss at most (2/3) + e. 

Before providing proofs, we introduce some notation. By 
every other norm will have an appropriate subscript, such as || 
training examples will be denoted by X. By i^{i) we mean the exp-loss e~(^'^)» on example 
i. For any subset S C X of examples, i'^{S) = Yli&s^'^i'^) denotes the total exp-loss on 
the set S. Notice L(A) = {l/m)£^{X), and that A+i(i) 
combination found by AdaBoost at the end of round t. 
obtained on the set S by the vector 77, when the weights over the examples are given by 
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Figure 4: A dataset requir- 
ing 51(l/e) rounds for conver- 
gence. 



we will mean ^2-iiorm; 
•||oo) etc. The set of all 



= £^\i)/i^\x), where A* is the 
By 55(77; A) we mean the edge 



Ssiv, A) 
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\i)iMrj), 
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In the rest of the section, by "loss" we mean the unnormahzed loss £^{X) = mL{\) and 
show that in C/e rounds AdaBoost converges to within e of the optimal unnormalized 
loss m.ixl^{X), henceforth denoted by K. Note that this means AdaBoost takes C/e 
rounds to converge to within e/m of the optimal normalized loss, that is to loss at most 
mix L[\) + e / m. Replacing e by me, it takes C /{me) steps to attain normalized loss at most 
inf;^ L[\) + e. Thus, whether we use normalized or unnormalized does not substantively 
affect the result in Theorem [T4l The progress due to the zero-loss set is now immediate 
from Item [1] of the decomposition lemma: 

Lemma 16 In any round t, the maximum edge 6t is at least 7^"^* ^ {Z)/£^^ ^ {^)> where 7 
is as in Item [I] of the decomposition lemma. 

Proof Recall the distribution Dt created by AdaBoost in round t puts weight Dt{i) = 
£^ {i)/£^ {X) on each example i. From Item [T] we get 



i^'-'{X)^^ ^''~^\i^'-'{X) 



Since {'M.r]^)i = J2j vl(]^^j)ii we may rewrite the edge dxiv^ ', X^~^) as follows: 



Since the £i-norm of t]^ is 1, the weights 
1, . . . , A^. We may therefore conclude 



form some distribution p over the columns 



[ fx^-^lx] ) = '^^(^^;^*"') ^ ^^-P [5x(e,; A*-i)] < max5x(e,; A*-i) < 6t. 



If the set F were empty, then Lemma [16] implies an edge of 7 is available in each round. 
This in fact means that the weak learning assumption h olds, and using (HI), we can show 
an 0(ln(l/g)7~^) b o und matching the rate bounds of iFreund and Schapird ( 19971 ) and 
Schapire and Singer (jl999l ). So henceforth, we assume that F is non-empty. Note that 
this implies that the optimal loss K is at least 1 (since any solution will get non-positive 
margin on some example in F), a fact we will use later in the proofs. 

Lemma [16] says that the edge is large if the loss on the zero-loss set is large. On the 
other hand, when it is small, Lemmas [T7] and [18] together show how AdaBoost can make 
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good progress using the finite margin set. Lemma [T7] uses second order methods to show 
how progress is made in the case where there is a finite solution. Sini ilar arguments, under 
additional assumptions, have earlier appeared in (jRatsch et al.l . l2002l ) . 

Lemma 17 Suppose X is a combination such that m > £^{F) > K. Then in some co- 
ordinate direction the edge is at least y^Co {i^{F) — K) /£^{F), where Cq is a constant 
depending only on the feature matrix M. 

Proof Let M^? G rI^I^^ be the matrix M restricted to only the rows corresponding to the 
examples in F. Choose ij such that X + ij = rj* is an optimal solution over F. Without loss 
of generality assume that rj lies in the orthogonal subspace of the null-space {u : Mi?u = 0} 
of MiT' (since we can translate ij* along the null space if necessary for this to hold). If 77 = 0, 
then i^{F) = K and we are done. Otherwise ||Mi;'T7|| > Amin||*?||, where A^jj^ is the smallest 
positive eigenvalue of the symmetric matrix M]?M/? (exists since ^sJlpf] 0)- Now define 
/ : [0, 1] — R as the loss along the (rescaled) segment [q* , A] 

/(x) = ^(^*-^^)(F) = ^K(i)e^(^^)\ 

This implies that /(O) = K and /(I) = £^{F). Notice that the first and second derivatives 
of f{x) are given by: 

fix) = j;(MFr7),^(^*--^)(i), f"{x) = J](MFr7)2^(^*--^)(i). 

We next lower bound possible values of the second derivative as follows: 

f"{x) = ^(M^r7)2,£(^*"^^)(i') > ^(Mi.r/)f,min£(^*-^^)(i) > HMj^ryf min£(^*-^^)(i). 

Since both X = rj* — t], and 77* suffer total loss at most m, by convexity, so does rj* — xr/ 
for any x G [0, 1]. Hence we may apply Item [3] of the decomposition lemma to the vector 
rj*-xri, for anyx G [0,1], to conclude that £(^*~^^)(i) = exp {-(Mi;'(?7* - xr/))j} > e"'"™''" 
on every example i. Therefore we have, 

f"{x) > ||Mi;'?7||^e"''"""" > A^i„e"''""""||r7||^ (by choice of rj) . 

A standard second-order result is (see e.g. Boyd and Vandenberghe . 20041 . eqn. (9.9)) 

|/'(l)f >2( inf /"(x)) (/(l)-/(0)). 

Collecting our results so far, we get 



^£^i)(Mr,), = |/'(1)| > ||r/||^2A,^i„e-M-x (£A(^) _ j^y 
ieF 

Next let fj = ''7/ 11^711 1 be rescaled to have unit ii norm. Then we have 

^i\i)iMfi), = J_^£^(i)(Mr,), > j!I^^2Xl,^e-f^^^^{iHF)-K). 

TP 11/111,- 11/111 



ieF 
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Applying the Cauchy-Schwarz inequality, we may lower bound by l/\/iV (since rj £ 
M^). Along with the fact (.^{F) < m, we may write 

If we define p to be a distribution on the columns {!,..., A^} of M^? which puts probability 
p{j) proportional to \ fij\ on column j, then we have 



< max 
j 



^ ^ ieF ^ ' ieF 

Notice the quantity inside the max is precisely the edge 5F{ej; A) in direction j. Combining 
everything, the maximum possible edge is 



max<5i.(e,; A) > \ Co {i^{F) - K) /£>^{F), 



where we define Co = 2 A^: A 



Lemma 18 Suppose, at some stage of boosting, the combination found by AdaBoost is X, 
and the loss is K + 6. Let /S.9 denote the drop in the suboptimality 6 after one more round; 
i.e., the loss after one more round is K + 9 — A9. Then there are constants Ci,C2 depending 
only on the feature matrix (and not on 6), such that if £^{Z) < CiO, then A9 > C29. 

Proof Let A be the current solution found by boosting. Using LemmafTTl pick a direction j 



in which the edge ^^(ej; A) restricted to the finite loss set is at least ■\/2Cq{£^{F) — K)/£^{F). 
We can bound the edge 6x{ej; A) on the entire set of examples as follows: 



> 



> 




iez 



{F)5F{ej;\) 



iez 



(i) (using the triangle inequality) 



2Co{£^{F) - K)£\F) - 1^{Z) 



Now, £^{Z) < Ci9, and £^{F)-K = 9-£^{Z) > {1-Ci)9. Further, we will choose Ci < 1, 
so that £^{F) > K > 1. Hence, the previous inequality implies 



6x{ej;X) 



> 



1 



^/2Coil - Ci)9 - Ci9) . 



Set Ci = min|l/2, (l/4)yCo/(2m)|. Using 9 < K + 9 = £^{X) < m, we can bound the 
square of the term in brackets on the previous line as 

, 2 



(V2Co(l-Ci)^-Ci^ 



> 2Co(l - Ci)9 - 2Ci9y^2Co{l - Ci)9 

> 2Co(l - 1/2)0 - 2 ((1/4) VCo/(2m)) ^^2^0(1 - 0)m 
= Co9/2. 
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So, if 5 is the maximum edge in any direction, then 

S > 6x(.ej;X) > ^/cJMkTW) > VCoO/{2m{K + e)), 

where, for the last inequahty, we again used K + 6 < m. Therefore the loss after one more 
step is at most {K + e)VT^ < {K + e){l - /2) < K + O-^O. Setting C2 = Co/(4m) 
completes the proof. ■ 



Proof of Theorem ll4l At any stage of boosting, let A be the current combination, and K+ 
9 be the current loss. We show that the new loss is at most K+6—A9 for AO > C^O'^ for some 
constant C3 depending only on the dataset (and not 9). To see this, either 1^{Z) < Ci9, 
in which case Lemma [TSl applies, and > 6*2^ > (C2/m)^^ (since 9 = £^{X) — K < m). 
Or ^^{Z) > Ci9, in which case applying Lemma [T6l yields 6 > ^Ci9/(.^{X) > {^Ci/m)9. 
By dH), A9 > t^{X){l - Vl - -52) > l^{X)8'^/2 > {K/2){-fCi/m)'^9'^. Using K > 1 and 
choosing C3 appropriately gives the required condition. 

If K + Of denotes the loss in round t, then the above claim implies 9t — 9t+i > 0-^9^. 
Applying Lemma [32] to the sequence {Ot\ we have I/Ot — l/^o ^ C^T for any T. Since 
9q > 0, we have T < 1/{C^0t)- Hence to achieve loss K + rounds suffice. ■ 

4.2 Proof of the decomposition lemma 

Throughout this section we only consider (unless otherwise stated) admissible combinations 
A of weak classifiers, which have loss (.^{X) bounded by m (since such are the ones found 
by boosting). We prove Lemma [15] in three steps. We begin with a simple lemma that 
rigorously defines the zero-loss and finite-margin sets. 

Lemma 19 For any sequence 771,772,..., of admissible combinations of weak classifiers, 
we can find a subsequence 77(1) = 77^^ , 77(2) = ''7*2 ) • • • ) whose losses converge to zero on all 
examples in some fixed (possibly empty) subset Z (the zero-loss set), and losses bounded 
away from zero in its complement X \ Z (the finite-margin set) 

Vx G Z : lim i^W (x) = 0, Vx G X \ Z : inf ^^(*) (x) > 0. (8) 

t— >-oo i 

Proof We will build a zero-loss set and the final subsequence incrementally. Initially the 
set is empty. Pick the first example. If the infimal loss ever attained on the example in the 
sequence is bounded away from zero, then we do not add it to the set. Otherwise we add 
it, and consider only the subsequence whose t*^ element attains loss less than 1/t on the 
example. Beginning with this subsequence, we now repeat with other examples. The final 
sequence is the required subsequence, and the examples we have added form the zero-loss 
set. ■ 

We apply Lemma [19] to some admissible sequence converging to the optimal loss (for in- 
stance, the one found by AdaBoost). Let us call the resulting subsequence 77^^^, the obtained 
zero-loss set Z, and the finite-margin set F = X\Z. The next lemma shows how to extract 
a single combination out of the sequence r/'^^-^ that satisfies the properties in Item [1] of the 
decomposition lemma. 



19 



MUKHERJEE, RUDIN AND SCHAPIRE 



Lemma 20 Suppose M is the feature matrix, Z is a subset of the examples, and 77(1) , 77(2) , ■ ■ ■ 
is a sequence of combinations of weak classifiers such that Z is its zero loss set, and X \ Z 
its finite loss set, that is, ([8]) holds. Then there is a combination rj^ of weak classifiers that 
achieves positive margin on every example in Z, and zero margin on every example in its 
complement X\Z , that is: 



(Mr? 



>0 i/iGZ, 
= ifieX\Z. 



Proof Since the r;^^) achieve arbitrarily large positive margins on Z, ||?7(()|| will be un- 
bounded, and it will be hard to extract a useful single solution out of them. On the other 
hand, the rescaled combinations T7(t)/||'?(t) || lie on a compact set, and therefore have a limit 
point, which might have useful properties. We formalize this next. 

We prove the statement of the lemma by induction on the total number of training 
examples \X\. If X is empty, then the lemma holds vacuously for any r]^ . Assume inductively 
for all X of size less than m > 0, and consider X of size m. Since translating a vector along 
the null space of M, ker M = {x : Mx = 0}, has no effect on the margins produced by the 
vector, assume without loss of generality that the '7(t)'s are orthogonal to ker M. Also, since 
the margins produced on the zero loss set are unbounded, so are the norms of ?7((). Therefore 
assume (by picking a subsequence and relabeling if necessary) that ||J7(()|| > t. Let rj' be a 
limit point of the sequence 'n{t)/\\'n(t) 11) ^ unit vector that is also orthogonal to the null-space. 
Then firstly 77' achieves non-negative margin on every example; otherwise by continuity 
for some extremely large t, the margin of '7(t)/||^7(t) || on that example is also negative 
and bounded away from zero, and therefore '7(f) 's loss is more than m, a contradiction to 
admissibility. Secondly, the margin of rj' on each example in X \ Z is zero; otherwise, by 
continuity, for arbitrarily large t the margin of '7(f)/ 11^7(4) || on an example in is positive 

and bounded away from zero, and hence that example attains arbitrarily small loss in the 
sequence, a contradiction to ([8]). Finally, if rj' achieves zero margin everywhere in Z, then 
T]', being orthogonal to the null-space, must be 0, a contradiction since 77' is a unit vector. 
Therefore 77' must achieve positive margin on some non-empty subset S of Z, and zero 
margins on every other example. 

Next we use induction on the reduced set of examples X' = X\S. Since S is non-empty, 
\X'\ < m. Further, using the same sequence ^{1)1 the zero-loss and finite-loss sets, restricted 
to X', axe Z' = Z\S im<l{X\Z)\S = X\Z (since S <^ Z) = X' \ Z' . By the inductive 
hypothesis, there exists some 77" which achieves positive margins on Z' , and zero margins 
on X' \ = X\Z. Therefore, by setting r]^ = rj' + crj" for a large enough c, we can achieve 
the desired properties. ■ 

Applying Lemma [20] to the sequence 77^^^ yields some convex combination 77^ having margin 
at least 7 > (for some 7) on Z and zero margin on its complement, proving Item [T] of the 
decomposition lemma. The next lemma proves Item[2j 

Lemma 21 The optimal loss considering only examples within F is achieved by some finite 
combination 77*. 

Proof The existence of 77^ with properties as in Lemma [20] implies that the optimal loss is 
the same whether considering all the examples, or just examples in F. Therefore it suffices 
to show the existence of finite 77* that achieves loss K on F, that is, i'^ (F) = K. 
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Recall Mi? denotes the matrix M restricted to the rows corresponding to examples in 
F. Let kerMj? = {x : M^^x = 0} be the null-space of M^?. Let r/^*) be the projection of 
J7^j^ onto the orthogonal subspace of keiMp. Then the losses (F) = converge 

to the optimal loss K. If M^? is identically zero, then each 77^*) = 0, and then rj* = 
has loss K on F. Otherwise, let be the smallest positive eigenvalue of M|^Mi?. 
Then ||Mr/^*)|| > A||t7^*^||. By the definition of finite margin set, inft-^oo '^^^^ieF = 
inft_>.oo niinjgi? (i) > 0. Therefore, the norms of the margin vectors HMTy*-*^ ||, and hence 
that of J7^*\ are bounded. Therefore the rj^^^^s have a (finite) limit point r]* that must have 
loss K over F. ■ 

As a corollary, we prove Item[3l 

Lemma 22 There is a constant /Umax < 00, such that for any combination rj that achieves 
hounded loss on the finite-margin set, P^{F) < m, the margin {NL'q)i for any example i in 
F lies in the hounded interval [— lnm,^max] • 

Proof Since the loss P^{F) is at most m, therefore no margin may be less than — Inm. To 
prove a finite upper bound on the margins, we argue by contradiction. Suppose arbitrarily 
large margins are producible by bounded loss vectors, that is arbitrarily large elements are 
present in the set {(Mrj) ■ : i'^{F) < m,l < i < m}. Then for some fixed example x G F 
there exists a sequence of combinations of weak classifiers, whose t^^ element achieves more 
than margin t on x but has loss at most m on F. Applying Lemma [19] we can find a 
subsequence A^*^ whose tail achieves vanishingly small loss on some non-empty subset S of 
F containing x, and bounded margins in F \ 5. Applying Lemma [20] to A^*) we get some 
convex combination A^ which has positive margins on S and zero margin on F \ S". Let r/* 
be as in Lemma \2T\ a finite combination achieving the optimal loss on F. Then rj* -|- 00 • A^ 
achieves the same loss on every example in F \ S as the optimal solution ij* , but zero loss 
for examples in S. This solution is strictly better than ij* on F, a contradiction to the 
optimality of rj*. Therefore our assumption is false, and some finite upper bound ^max on 
the margins (M77)j of vectors satisfying i'^{F) < m exists. ■ 



4.3 Investigating the constants 

In this section, we try to estimate the constant C in Theorem [TJ] We show that it can be 
arbitrarily large for adversarial feature matrices with real entries (corresponding to confi- 
dence rated weak hypotheses), but has an upper-bound doubly exponential in the number 
of examples when the feature matrix has {—1,0,+!} entries only. We also show that this 
doubly exponential bound cannot be improved without significantly changing the proof in 
the previous section. 

By inspecting the proofs, we can bound the constant in Theorem 1141 as follows. 

Corollary 23 The constant C in Theorem [I^ that emerges from the proofs is 
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where m is the number of examples, N is the number of hypotheses, 7 and ^max o-i^^ cls given 
by Items [I] and of the decomposition lemma, and A^j^^ is the smallest positive eigenvalue 
o/M^^Mi? (M.F is the feature matrix restricted to the rows belonging to the finite margin 
set F). 

Our bound on C will be obtained by in turn bounding the quantities A~jj^, 7~^, /Xmax- These 
are strongly related to the singular values of the feature matrix M, and in general cannot 
be easily measured. In fact, when M has real entries, we have already seen in Section 13.31 
that the rate can be arbitrarily large, implying these parameters can have very large values. 
Even when the matrix M has integer entries (that is, — 1,0, +1), the next lemma shows 
that these quantities can be exponential in the number of examples. 

Lemma 24 There are examples of feature matrices with — 1,0,+1 entries and at most m 
rows or columns (where m > 10) for which the quantities 7~-'^,A~-'^ and /Umax cire at least 
n{2"^/m). 

Proof We first show the bounds for 7 and A. Let M be an m x m upper triangular 
matrix with +1 on the diagonal, and —1 above the diagonal. Let y = (2™"^, 2"*~^, . . . , 1)^, 
and b = (1,1,..., 1)^. Then My = b, although the y has much bigger norm than b: 
||y|| > 2™""^, while ||b|| = m. Since M is invertible, by the definition of Amin> we have 
||My|| > Aminllyll, so that A;;.^^ > ||y||/||My|| > 2™/m. Next, note that y produces ah 
positive margins b, and hence the zero-loss set consists of all the examples. In particular, 
if T]^ be as in Item [1] of the decomposition lemma, then the vector 7~^?7^ achieves more 
than 1 margin on each example: M(7~^77^) > b. On the other hand, our matrix is very 
similar to the one in Lemma [TOl and the same arguments in the proof of that lemma can 
be used to show that if for some x we have (Mx) > b, then x > y. This implies that 
7~^ll^^lli — lly||i = (2*" ~ !)• Since r]^ has unit £i-norm, the bound on 7"-*^ follows too. 

Next we provide an example showing /imax can be i}(2"^/m). Consider an m x (m — 1) 
matrix M. The bottom row of M is all +1. The upper (m — 1) x (m — 1) submatrix 
of M is a lower triangular matrix with —1 on the diagonal and +1 below the diagonal. 
Observe that if y-^ = (2"^"^, 2™~^, . . . , 1, 1), then y^M = 0. Therefore, for any vector 
X, the inner product of the margins Mx with y is zero: y-^Mx = 0. This implies that 
achieving positive margin on any example forces some other example to receive negative 
margin. By Item [1] of the decomposition lemma, the zero loss set in this dataset is empty, 
and all the examples belong to the finite loss set. Next, we choose a combination with at 
most m loss that nevertheless achieves Q{2"^/m) positive margin on some example. Let 
x'^ = (1,2,4, ... ,2'"-2). Then (Mx)^ = (-1, -1, . . . , -1, 2'"-^ - 1). Then the margins 
using ex are {—e,...,—e,e{2"^~^ — 1)) with total loss (m — l)e^ + e^(^~^™ \ Choose 
£ = l/(2m) < 1, so that the loss on examples corresponding to the first m — 1 rows is at 
most < 1 + 2e = 1 + 1/m, where the first inequality holds since e G [0, 1]. For m > 10, 
the choice of e guarantees 1/(2?ti) = £ > (ln?Ti)/(2"*~-'^ ~ 1)) so that the loss on the example 
corresponding to the bottom most row is e~^^^'" < e"^"^™ = 1/m. Therefore the net 
loss of ex is at most (m — 1)(1 + 1/m) + 1/m = m. On the other hand the margin on the 
example corresponding to the last row is e{2^^^ — 1) = {2'^~^ — l)/(2m) = Q{2^/m). ■ 

The above result implies any bound on C derived from Corollary 1231 will be at least 2^(^'"/™) 
in the worst case. This does not imply that the best bound one can hope to prove is doubly 
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exponential, only that our techniques in the previous section do not admit anything better. 
We next show that the bounds in Lemma [23] are nearly the worst possible. 

Lemma 25 Suppose each entry ofM. is —1,0 or +1. Then each of the quantities 
and /Umax «re at most 2'^('"^'^™). 

The proof of Lemma [25] is rather technical, and we defer it to the Appendix. Lemma [25] and 
Corollary [23] together imply a convergence rate of '""'"'"'^'^ ^j^g optimal loss for integer 
matrices. This bound on C is exponentially worse than the 17(2™) lower bound on C we saw 
in Section 13. 3[ a price we pay for obtaining optimal dependence on e. In the next section 
we will see how to obtain poly(2™^'^"', e~^) bounds, although with a worse dependence on 
e. We end this section by showing, just for completeness, how a bound on the norm of rj* 
as defined in Item [2] of the decomposition lemma follows as a quick corollary to Lemma [23 



Corollary 26 Suppose rj* is as given by Item [^ of the decomposition lemma. When the 
feature matrix has only — 1,0,+1 entries, we may bound ||?7*||i < 2'^^™ ''^'"). 

Proof Note that every entry of 'M.prj* lies in the range [— lnm,/imax = 2'^^™'''^'")], and 
hence ||Mi.r7*|| < 2'^(™''^™). Next, we may choose rj* orthogonal to the null space of M^;'; 
then ||?7*|| < A'J-jJiMiT'ry*!! < 2'^('"''^'^). Since ||77*||i < \/iV||T7*||, and the number of possi- 
ble columns with { — 1, 0, +1} entries is at most 3'", the proof follows. ■ 



5. Improved Estimates 

In this section we shed more light on the rate bounds by cross-application of techniques 
from Sections [3] and lU We obtain both new upper bounds for convergence to the optimal 
loss, as well as lower bounds for convergence to an arbitrary target loss. We also indicate 
what we believe might be the optimal bounds for either situation. 

We first show how the finite rate bound of Theorem [1] along with the decomposition 
lemma yields a new rate of convergence to the optimal loss. Although the dependence on 
e is worse than in Theorem [T^ the dependence on m is nearly optimal. We will need the 
following key application of the decomposition lemma. 

Lemma 27 When the feature matrix has — 1,0, -|-1 entries, for any e > 0, there is some 
solution with ii-norm at most 2*^(™'^'^"') ln(l/e) that achieves within e of the optimal loss. 

Proof Let rj* ,rj'^ ,^ be as given by the decomposition lemma. Let c = minjg^ (^77*)^ be 
the minimum margin produced by r/* on any example in the zero-loss set Z. Then 77* — crj^ 
produces non-negative margins on Z, and the optimal margins on the finite loss set F. 
Therefore, the vector A* = 77*-!- (ln(l/e)7~^ — c) 77^ achieves at least ln(l/e) margin on every 
example in Z, and optimal margins on the finite loss set F. Hence L{\*) < inf;^L(A) -|- e. 
Using |c| < ||Mr7*|| < m||T7*||, and the results in Corollary [26] and Lemma 1251 we may 
conclude the vector A* has £i-norm at most 2'^('"^°™') ln(l/e). ■ 

We may now invoke Theorem [1] to obtain a 2*^('"^°™') ln^(l/e)e"^ rate of convergence to the 
optimal solution. Rate bounds with similar dependence on m and slightly better dependence 



23 



MUKHERJEE, RUDIN AND SCHAPIRE 



on e can be obtained by modifying the proof in Section [J] to use first order instead of second 
order techniques. In that way we may obtain a poly(A~Jjj, 7"^, /Umax)^' 
rate bound. We omit the the rather long but straightforward proof of this fact. Finally, 
note that if Conjecture [6] is true, then Lemma [271 implies a 2'-^('"^°™) ln(l/e)e~^ rate bound 
for converging to the optimal loss, which is nearly optimal in both m and e. We state this 
as an independent conjecture. 

Conjecture 28 For feature matrices with — 1,0, +1 entries, AdaBoost converges to within 
e of the optimal loss within 2'^(™i°'")e~(i+°(i)) rounds. 

We next focus on lower bounds on the convergence rate to arbitrary target losses dis- 
cussed in Section [3l We begin by showing the rate dependence on the norm of the solution 
as given in Lemma [9] holds for much more general datasets. 

Lemma 29 Suppose a feature matrix has only ±1 entries, and the finite loss set is non- 
empty. Then, for any coordinate descent procedure, the number of rounds required to achieve 
a target loss (p* is at least 

inf {||A||i : L(A) < 0*} /(I + Inm). 

Proof It sufHces to upper-bound the step size \at\ in any round t by at most 1 + lnm. 
Notice that when the feature matrix has ±1 entries, a step in a direction that does not end 
up increasing the loss is at most of length (1/2) In ((1 + 6) / (1 — 5)), where 6 is the edge in 
that direction. Therefore, if 6t is the maximum edge achievable in any direction, we have 

, , I, fl + St 
laA < — In 



2 

Further, by (jl]), a large edge St ensures that for some coordinate step, the new vector 
A* will have much smaller loss than the vector A*~^ at the beginning of round t: L(A*) < 
L(A*~-'^)y^1 — 6f. On the other hand, before the step, the loss is at most 1, L(A*^^) < 1, and 
after the step the loss is at most 1/m (since the optimal loss on a dataset with non-empty 
finite set is at least 1/m): L{\^) > 1/m. Combining these inequalities we get 



1/m < L(A*) < L(A*-i)^l - 52 < ^1 _ 61 
that is, ^/l — 6^ > 1/m. Now the step length can be bounded as 

\at\ < - In (^-^] = ln(l + 5t) - -ln(l -5?) < 5t + lnm < 1 + lnm. 
2 \ 1 - dt / 2 



We end by showing a new lower bound for the convergence rate to an arbitrary target loss 
studied in Section [3l Corollary [TT] implies that the rate bound in Theorem [T] has to be 
at least polynomially large in the norm of the solution. We now show that a polynomial 
dependence on in the rate is unavoidable too. This shows that rates for competing with 
a finite solution are different from rates on a dataset where the optimum loss is achieved by 
a finite solution, since in the latter we may achieve a O (ln(l/e)) rate. 
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Corollary 30 Consider any dataset (e.g. the one in Figure^ for which Q{l/e) rounds 
are necessary to get within e of the optimal loss. If there are constants c and (3 such that 
for any A* and e, a loss of L{X*) + e can be achieved in at most 0(|| A* rounds, then 

/3>1. 

Proof The decomposition lemma implies that A* = t]* +ln{2 / e)r]^ with £i-norm 0(ln(l/e)) 
achieves loss at most K -\- e/2 (recall K is the optimal loss). Suppose the corollary fails 
to hold for constants c and /3 < 1. Then L{X*) + e/2 = K + e loss can be achieved in 
0{e~^) / h\^{l/ e)) = o(l/e) rounds, contradicting the VL{l/e) lower bound. ■ 



6. Conclusion 

In this paper we studied the convergence rate of AdaBoost with respect to the exponential 
loss. We showed upper and lower bounds for convergence rates to both an arbitrary target 
loss achieved by some finite combination of the weak hypotheses, as well as to the infimum 
loss which may not be realizable. For the first convergence rate, we showed a strong re- 
lationship exists between the size of the minimum vector achieving a target loss and the 
number of rounds of coordinate descent required to achieve that loss. In particular, we 
showed that a polynomial dependence of the rate on the £i-norm B of the minimum size 
solution is absolutely necessary, and that a poly(i3, 1/e) upper bound holds, where e is the 
accuracy parameter. The actual rate we derive has rather large exponents, and we discuss 
a minor variant of AdaBoost that achieves a much tighter and near optimal rate. 

For the second kind of convergence, using entirely separate techniques, we derived a C/e 
upper bound, and showed that this is tight up to constant factors. In the process, we showed 
a certain decomposition lemma that might be of independent interest. We also study the 
constants and show how they depend on certain intrinsic parameters related to the singular 
values of the feature matrix. We estimate the worst case values of these parameters, and 
considering feature matrices with only { — 1, 0, +1} entries, this leads to a bound on the rate 
constant C that is doubly exponential in the number of training examples. Since this is 
rather large, we also include bounds polynomial in both the number of training examples 
and the accuracy parameter e, although the dependence on e in these bounds is non-optimal. 

Finally, for each kind of convergence, we conjecture tighter bounds that are not known 
to hold presently. A table containing a summary of the results in this paper is included in 
Figure O 
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Convergence rate 
with respect to: 


Reference solution (Section [3D 


Optimal solution (Section U]) 


Upper bounds: 




poly(e^"^^^A-L,7-^)/e<2^"^'"'"""A 


poly(/in.ax,A^[,,7^^)/e3 < 20{rnX.m) f^i 


Lower bounds with: 
a) {0, ±1} entries 


{B/eY~'^ for any constant u 


f2"Mn(l/e) 2I 
™^^\ Inm '9e/ 


(2'^/lnm)ln(l/e) 


b) real entries 


Can be arbitrarily large even when m, e are held fixed 


Conjectured upper 
bounds: 




20{minm)/gi+o(i)^ if entries in {0, ±1} 



Figure 5: Summary of our most important results and conjeetures regarding the convergence rate 
of AdaBoost. Here m refers to the number of training examples, and e is the accuracy parameter. 
The quantity B is the £i-norm of the reference solution used in Section [31 The parameters Amin, 7 
and /imax depend on the dataset and arc defined and studied in Section [4] 



Appendix 

Lemma 31 For any e < 1/3, to get within e of the optimum loss on the dataset in Table\^ 
AdaBoost takes at least 2/(9e) steps. 

Proof Note that the optimal loss is 2/3, and we are bounding the number of rounds 
necessary to get within (2/3) + e loss for e < 1/3. We will compute the edge in each round 
analytically. Let wl^,wl,wl denote the normalized-losses (adding up to 1) or weights on 
examples a, b, c at the beginning of round t, ht the weak hypothesis chosen in round t, and 
6t the edge in round t. The values of these parameters are shown below for the first 5 
rounds, where we have assumed (without loss of generality) that the hypothesis picked in 
round 1 is hh'. 



Round 


wi 


wl 


wi 


ht 


St 


t = l: 


1/3 


1/3 


1/3 


hb 


1/3 


t = 2: 


1/2 


1/4 


1/4 


K 


1/2 


t = 3: 


1/3 


1/2 


1/6 


hb 


1/3 


t = 4 : 


1/2 


3/8 


1/8 


K 


1/4 


t = 5: 


2/5 


1/2 


1/10 


hb 


1/5. 



Based on the patterns above, we first claim that for rounds t >2, the edge achieved is 1/t. 
In fact we prove the stronger claims, that for rounds t > 2, the following hold: 

1. One of and wl is 1/2. 

2. 6t+i=6t/{l + 6t). 

Since 62 = 1/2, the recurrence on 6t would immediately imply 6t = 1/t for t > 2. We prove 
the stronger claims by induction on the round t. The base case for t = 2 is shown above 
and may be verified. Suppose the inductive assumption holds for t. Assume without loss 
of generality that 1/2 = w^ > wl > w;*; note this implies wl = 1 — {w^ + w^) = 1/2 — w^. 
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Further, in this round, ha gets picked, and has edge 6t = w\ + w^^ — w\ = 2i«*. Now for any 
dataset, the weights of the examples labeled correctly and incorrectly in a round of AdaBoost 
are rescaled during the weight update step in a way such that each add up to 1/2 after the 



, — 1/2, it;;^" = Wr \ t I t 



rescaling. Therefore, w*"*"^ — i /9 

in round t + 1 and, as before, we get edge 5t+i 
proof of our claim follows by induction. 



1/2 



Next we find the loss after each iteration. Using 5i 
loss after T rounds can be written as 



w*/(l + 2ii;*). Hence, fii, gets picked 
2wl+^ = 2wl/{l + 2wi) = 5t/{l + 5t). The 

1/3 and 5t = 1/t for t > 2, the 



n 

i=l 



Vi- (1/3)2 nx/r^=^ 

t=2 



T 

n 



t - 1 



t + i 



The product can be rewritten as follows: 



t- 1 



t+1 





T 

n 

t=2 

Notice almost all the terms cancel, except for the first term of the first product, and the 
last term of the second product. Therefore, the loss after T rounds is 




1 2 
^ + T^3 



1 + 



1 

3T 



2 2 

3 9T' 



where the inequality holds for T > 1. Since the initial error is 1 = (2/3) + 1/3, therefore, 
for any e < 1/3, the number of rounds needed to achieve loss (2/3) + e is at least 2/(9e). ■ 



Lemma 32 Suppose uq, ui, . . . , are non-negative numbers satisfying 

ut - ut+i > cqu]^"^ , 

for some non-negative constants cq,ci. Then, for any t, 

1 1 

Proof By induction on t. The base case is an identity. Assume the statement holds at 
iteration t. Then, 

— = — ^ + — - — > — 51 + cicot by mductive hypothesis) 

Thus it suffices to show 1/n^^-^ — l/u^^ > ciCq. Multiplying both sides by u^^ and adding 
1, this is equivalent to showing (ut/ut+iY-^ > 1 + ciCqu'^:^ . We will in fact show the stronger 
inequality 

{ut/ut+ir >{l + cou1'r . (9) 
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Since (1 + a)^ > 1 + ba for a,b non-negative, 1^ will imply {ut/ut+iY^ > (1 + cou^^y^ > 
1 + ciCqu'I^ , which will complete our proof. To show ([9]), we first rearrange the condition on 
Ut,Ut+i to obtain 

ci \ 1 
ut+i <ut{l- cou/ =^ > ^. 

Applying the fact (1 + CQiij^) (1 — cqm^^) < 1 to the previous equation we get, 

Ut 



Ut+l 



Since ci > 0, we may raise both sides of the above inequality to the power of ci to show 
([9]), finishing our proof. ■ 



Proof of Lemma 1251 

In this section we prove Lemma [25l by separately bounding the quantities A~j^, 7"^ and 
/^max) through a sequence of Lemmas. We will use the next result repeatedly. 

Lemma 33 If A is an nxn invertible matrix with —1, 0, +1 entries, then minx:||x||=i|| Ax|| 
is at least 1/n! = 2-0("i'^"). 

Proof It suffices to show that ||A~"'^x|| < n\ for any x with unit norm. Now A"'^ = 
adj(A)/ det(A) where adj(A) is the adjoint of A, whose i,j-th entry is the i,jth. cofactor of 
A (given by (—1)*^-^ times the determinant of the n — 1 x n — 1 matrix obtained by removing 
the ith row and jth column of A), and det(A) is the determinant of A. The determinant 
of any k x k matrix G can be written as sgn(iT) Y[i=i ^(^i ^U))^ where a ranges over all 
the permutations of 1, . . . , A:. Therefore each entry of adj(A) is at most (n — 1)!, and the 
det(A) is a non-zero integer. Therefore ||A^-'^x|| = ||adj(A)x||/ det(A) < n!||x||, and the 
proof is complete. ■ 

We first show our bound holds for Amin- 

Lemma 34 Suppose M has — 1,0,+1 entries, and let Mi?,Amin be as in Corollary \2S[ 
Then Amin ^ l/m\. 

Proof Let A denote the matrix M^?. It suffices to show that A does not squeeze too 
much the norm of any vector orthogonal to the null-space ker A = {rj : Ar] = 0} of A, i.e. 
||AA|| > (l/m!)||A|| for any A G kerA^. We first characterize ker A"*- and then study how 
A acts on this subspace. 

Let the rank of A be A: < m (notice A = Mi? has N columns and fewer than m rows). 
Without loss of generality, assume the first k columns of A are independent. Then every 
column of A can be written as a linear combination of the first k columns of A, and we 
have A = A'[I|B] (that is, the matrix A is the product of matrices A' and [I|B]), where A' 
is the submatrix consisting of the first k columns of A, I is the k x k identity matrix, and 
B is some k x (N — k) matrix of linear combinations (here | denotes concatenation). The 
null-space of A consists of x such that = Ax = A'[I|B]x = A'(xfc + Bx_fc), where Xk is 
the first k coordinates of x, and x_fc the remaining N — k coordinates. Since the columns 
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of A' are independent, this happens if and only if = — Bx_;t. Therefore kerA = 
|(— Bz,z) : z G R^~'^}. Since a vector x lies in the orthogonal subspace of kerA if it is 
orthogonal to every vector in the latter, we have 

ker A^ = {(xfe,x_fc) : (xfc,Bz) = (x_fc,z) ,Vz € M^"^} . 

We next see how A acts on this subspace. Recall A = A'[I|B] where A' has k independent 
columns. By basic linear algebra, the row rank of A' is also fc, and assume without loss of 
generality that the first k rows of A' are independent. Denote by A^ the k x k submatrix 
of A' formed by these k rows. Then for any vector x, 

||Ax|| = ||A'[I|B]x|| = ||A'(xfc + Bx_fc)|| > || Afc(xfc + Bx„;^,)ll > -^Hxfe + Bx„fc||, 

where the last inequality follows from Lemma [33j To finish the proof, it suffices to show 
that ||xfc + Bx_fc|| > ||x|| for x G ker A"*-. Indeed, by expanding out ||xfc + Bx_fc|p as inner 
product with itself, we have 

||xfe + Bx_fcf = llxfcf + ||Bx_fcf + 2(xfc,Bx_fc) > ||xfcf + 2||x_fcf > ||xf , 

where the first inequality follows since x S ker A^ implies (xfc,Bx_fc) = (x_fc,x__fc). ■ 
To show the bounds on 7"^ and /imaxi we will need an intermediate result. 

Lemma 35 Suppose A is a matrix, and b a vector, both with —1,0,1 entries. If Ax = 
b,x > is solvable, then there is a solution satisfying ||x|| < k ■ k\, where k = rank(A). 

Proof Pick a solution x with maximum number of zeroes. Let J be the set of coordinates 
for which Xj is zero. We first claim that there is no other solution x' which is also zero on 
the set J. Suppose there were such an x'. Note any point p on the infinite line joining x, x' 
satisfies Ap = b, and pj = (that is, pi' = for i' £ J). If i is any coordinate not in J 
such that Xi 7^ x^, then for some point p* along the line, we have Pjujj} = 0. Choose i so 

that p* is as close to x as possible. Since x > 0, by continuity this would also imply that 
p* > 0. But then p* is a solution with more zeroes than x, a contradiction. 

The claim implies that the reduced problem A'x = b,x > 0, obtained by substituting 
xj = 0, has a unique solution. Let k = rank(A'), A^ be a A; x A; submatrix of A' with full 
rank, and b^ be the restriction of b to the rows corresponding to those of A^ (note that A', 
and hence A^, contain only — 1,0,+1 entries). Then, A^x = bfc,x > is equivalent to the 
reduced problem. In particular, by uniqueness, solving A^x = b^ automatically ensures 
the obtained x = (x, j) is a non-negative solution to the original problem, and satisfies 
||x|| = ||x||. But, by Lemma [33l 

||x|| < /c!||Afex|| = k\\\hk\\ <k-k\. 



The bound on 7 follows easily. 

Lemma 36 Let 7, r]^ be as in Item [i] of Lemma \15\ Then r}^ can be chosen such that 
7 > 1/ (VNm ■ ml) > 2-'^(™i'^'^) . 
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Proof We know that M(r/^/7) = b, where b is zero on the set F and at least 1 for every 
example in the zero loss set Z (as given by Item [1] of Lemma [T5]) . Since M is closed under 
complementing columns, we may assume in addition that > 0. Introduce slack variables 
Zi for i G Z, and let M be M augmented with the columns — Oj for i € Z, where Cj is 
the standard basis vector with 1 on the ith coordinate and zero everywhere else. Then, 
by setting z = NL{r]^ /j) — b, we have a solution {r]^/"f,z) to the system Mx = b,x > 0. 
Applying Lemma \35\ we know there exists some solution (y, z') with norm at most m ■ m\ 
(here z' corresponds to the slack variables). Observe that y/||y||i is a valid choice for rj^ 
yielding a 7 of l/||y||i > l/{^/Nm ■ ml). ■ 

To show the bound for /imax we will need a version of Lemma [351 with strict inequality. 

Corollary 37 Suppose A is a matrix, and h a vector, both with —1,0, 1 entries. //Ax = 
b,x > is solvable, then there is a solution satisfying ||x|| < 1 + k- k\, where k = rank(A). 

Proof Using Lemma [35l pick a solution to Ax = b,x > with norm at most k ■ k\. If 
X > 0, then we are done. Otherwise let y > satisfy Ax = b, and consider the segment 
joining x and y. Every point p on the segment satisfies Ap = b. Further any coordinate 
becomes zero at most once on the segment. Therefore, there are points arbitrarily close to 
x on the segment with positive coordinates that satisfy the equation, and these have norms 
approaching that of x. ■ 

We next characterize the feature matrix restricted to the finite- loss examples, which 
might be of independent interest. 

Lemma 38 IfMp is the feature matrix restricted to the finite-loss examples F (as given 
by Item [H of Lemma \15\) . then there exists a positive linear combination y > such that 
M^y = 0. 

Proof Item [3] of the decomposition lemma states that whenever the loss i^{F) of a vector 
is bounded by m, then the largest margin maxjgi?(Mi?x)j is at most ^max- This implies 
that there is no vector x such that M^t^x > and at least one of the margins (Mirx)j is 
positive; otherwise, an arbitrarily large multiple of x would still have loss at most m, but 
margin exceeding the constant /^max- In other words, M^rx > implies M^^x = 0. In 
particular, the subspace of possible margin vectors jM^^x : x G R^} is disjoint from the 
convex set A^t^ of distributions over examples in F, which consists of points in M'^I with 
all non-negative and at least one positive coordinates. By the Hahn-Banach Separation 
theorem, there exists a hyperplane separating these two bodies, i.e. there is a y G M)^^, 
such that for any x € and p G A^?, we have (y, M^^x) < < (y, p). By choosing p = 
for various i £ F, the second inequality yields y > 0. Since Mjt'X = — Mj7'(— x), the first 
inequality implies that equality holds for all x, i.e. y-^Mi;' = 0^. ■ 

We can finally upper-bound /tfmax- 

Lemma 39 Let F, /Umax be as in Items of the decomposition lemma. Then /-fmax < 
Inm • • |F|! < 2'^(™''^'"). 

Proof Pick any example i £ F and any combination A whose loss on F, ^i^p e~^^^^\ 
is at most m. Let b be the i*^ row of M, and let A-^ be the matrix M^? without the ith 
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row. Then Lemma [38] says that Ay = — b for some positive vector y > 0. This imphes 
the margin of A on example i is (MA)j = — y-^A-^A. Since the loss of A on F is at most 
m, each margin on F is at least — Inm, and therefore maxjgp (— A-'^A)^ < Inm. Hence, 
the margin on example i can be bounded as (MA)j = (^y"^,— A^A) < lnm||y||i. Using 
Corollarv 1371 we can find y with bounded norm, ||y||i < \/\F\ ||y|| < y/\F\ (1 + k ■ kl) , 
where k = rank(A) < rank(Mi;') < \F\. The proof follows. ■ 
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